Raw data quality control
Initialization
We start the analysis by initializing the packages required for all the analysis performed in this section. We also define the root directory, within which all the input/output operations for this project will be performed. At the end of this document, detailed software version information is provided for easier reproducibility of the analysis.
library(reshape2)
library(ggplot2)
library(ggrepel)
library(ggpubr)
library(lsr)
library(plotly)
library(DT)
library(WriteXLS)
library(ggallin)
path = "/Users/ashwin/Documents/Projects/YeastScreen/Nonessential_MATa_screen/"Cleaning raw data
Firstly we read the raw yeast redox screen data and remove outlier (extreme values) and NA. Typically, outlier removal is not considered good data analysis practice, however, we do it here because we know that the extreme values comes from technical artifacts. For example, roGFP2 ratios cannot be negative and those which are significantly higher than the plate average are ambiguous signals.
Outliers were detected as -
- upper outlier values > 75th quantile value for a plate + 3 * plate IQR value
- lower outlier values < 25th quantile value for a plate - 3 * plate IQR value
The outlier values are all set to NA and values greater than the lower bound threshold but less than 0 were set to pseudo minimum value of 0.0001 as roGFP2 ratios can’t be less than 0.
We show these extreme outliers below, ploting the distribution of values from each plate.
rawDat = readRDS(paste0(path, "data/workspaces/YeastMutantRedox_RawData.RDS"))
tmpdat = na.omit(rawDat)
ggplot(tmpdat, aes(x = roGFP2.ratio, color = Plate)) + geom_density() + theme_bw(base_size = 8) +
labs(x = "roGFP2 ratio (in log10)") + scale_x_continuous(trans = pseudolog10_trans,
breaks = c(-50, -2:5, 50)) + facet_wrap(Content ~ Type, scales = "free") + theme(legend.position = "none",
panel.grid = element_blank())rawDatCleaned = vector("list", length = 3)
names(rawDatCleaned) = c("Glucose", "Galactose", "Glycerol")
for (i in names(rawDatCleaned)) {
dat = rawDat[rawDat$Type == i, ]
dat = split(dat, dat$Plate)
dat = lapply(dat, function(x) {
vals = x$roGFP2.ratio
lb = quantile(vals, probs = 0.25, na.rm = T) - (3 * IQR(vals, na.rm = T))
ub = quantile(vals, probs = 0.75, na.rm = T) + (3 * IQR(vals, na.rm = T))
vals[vals < lb | vals > ub] = NA
vals[vals > lb & vals < 0] = 10^-4
x$roGFP2.ratio = vals
x = x[!is.na(x$roGFP2.ratio), ]
return(x)
})
rawDatCleaned[[i]] = do.call("rbind", dat)
rm(dat)
}
rawDatCleaned = do.call("rbind", rawDatCleaned)
rawDatCleaned = droplevels(rawDatCleaned)
rownames(rawDatCleaned) = 1:nrow(rawDatCleaned)Next, lets check the summaries of the raw roGFP2 redox screen data and the cleaned raw data for comparison.
Raw data looked like -
Plate Well_96 Well Group
102 : 4512 E10 : 2496 Q13 : 156 AZ : 2496
105 : 4512 E4 : 2496 Q14 : 156 BE : 2496
108 : 4512 E9 : 2496 Q15 : 156 BF : 2496
109 : 4512 A1 : 2448 Q16 : 156 A : 2448
110 : 4512 A10 : 2448 Q33 : 156 AB : 2448
112 : 4512 A11 : 2448 Q34 : 156 AD : 2448
(Other):200736 (Other):212976 (Other):226872 (Other):212976
Content SystematicName SGD.ID Gene.Symbol
Blank :56952 Length:227808 Length:227808 Length:227808
Control :56952 Class :character Class :character Class :character
Cytoplasm :56952 Mode :character Mode :character Mode :character
Mitochondria:56952
roGFP2.ratio Type
Min. :-7491.00 Glucose :75936
1st Qu.: 0.36 Galactose:75936
Median : 0.49 Glycerol :75936
Mean : 0.48
3rd Qu.: 0.64
Max. : 1122.00
NA's :56952
Cleaned data looked like -
Plate Well_96 Well Group
102 : 3383 E4 : 1863 S35 : 156 AZ : 1863
105 : 3382 E9 : 1863 T15 : 156 BE : 1863
110 : 3375 E10 : 1855 T16 : 156 BF : 1855
136 : 3372 E6 : 1834 Q33 : 155 BB : 1834
116 : 3370 E8 : 1833 Q37 : 155 BD : 1833
113 : 3356 D10 : 1832 R13 : 155 AT : 1832
(Other):147811 (Other):156969 (Other):167116 (Other):156969
Content SystematicName SGD.ID Gene.Symbol
Control :55994 Length:168049 Length:168049 Length:168049
Cytoplasm :56174 Class :character Class :character Class :character
Mitochondria:55881 Mode :character Mode :character Mode :character
roGFP2.ratio Type
Min. :0.0000 Glucose :55858
1st Qu.:0.3569 Galactose:56129
Median :0.4843 Glycerol :56062
Mean :0.5106
3rd Qu.:0.6370
Max. :3.4600
Percentage of data removed after outlier removal -
[1] 26.23
Below is the table of cleaned raw data, which will be used in all of the next analyses